<<<<<<< HEAD

Classification task

The EUR-Lex dataset contains 25K documents, which makes it impossible to train a classifier over the whole dataset. We divided the dataset into 24 subsets, each subset contains the same number of documents, we trained the classifiers over the subsets. The final evaluations results will be the average seperatly reported results for each subset. The dataset has around 7000 labels, an older version had only 4000 labels. To observe the effect of including a large number of labels on the classification task in general, we compared the predective performance of the models when trained on a dataset with 7000 labels and when trained on a dataset with a reduced number of labels.

Classification Models

We use three methods to transform the problem of multi-label classification task into a conventional multi-class classification task: binary relevance, Label Powerset, classifier chain. After transforming the classification problem, We trained three different classification models: Random Forest, k nearest neighbors, XGboosted trees.

classifier Models
K nearest neighbours
Random Forest
XGboost
Transformation methods
Binary relevance
Classifier chain
Label powerset

We trained nine classifaction models and compared the performance of the classifiers to produce the model that best fits the dataset and to choose the type of features to yield the best predictive performance. The following table shows the nine classifier models :

KNN-Label Powerset RF-Label Powerset XGboost-Label Powerset
KNN-Binary Relevance RF-Binary Relevance XGboost-Binary Relevance
KNN-Classifier Chain RF-Classifier Chain XGboost-Classifier Chain

Experimental settings

As mentioned earlier, the 25K documents dataset was split into 24 subsets, and the previously mentioned classification models were trained separately in order to compare the performance of these different classifiers. Each subset was split randomly into two disjointly subsets one for training and the other for testing, with the following proportions ( 65% used for training and 35% used for testing). We reported the results of the different models under different settings, we wanted to explore the performance of the classifiers with two types of features, with two languages and with a different number of labels. The following table demonstrates the experimental settings:

Language
English
German
Features
TFIDF
incidence
Number of Labels
7000
<7000

The following code shows the funcitons used for classification, we use utiml package. To split the dataset into training and testing dataset we use the function create_holdout_partition().The example code is showen for binary relevace method. We simply change the method used into one of the following methods: lp() Label Powerset, br() Binary Relevance, cc() Classifier Chain we had to destroy the model after exporting the performance results to avoid any problematics with memory capacity using the function rm()

library(mldr)
library(utiml)

train_ratio <- 0.65
test_ratio <- 0.35
iteration <- 24

for(index in 1:iteration){

  ds <- mldr(paste(generic_name,index,sep = "")) %>%
  remove_skewness_labels(1) %>%
  remove_attributes("...") %>%
  remove_unique_attributes() %>%
  remove_unlabeled_instances() %>%
  create_holdout_partition(c(train=train_ratio, test=test_ratio))
  
  ## KNN - K nearest neighbour
  brmodel1 <- br(ds$train, "KNN")
  prediction1 <- predict(brmodel1, ds$test)
  temp_knn <- multilabel_evaluate(ds$test, prediction1, "bipartition")

  ##remove model of memory
  rm(brmodel1)
  rm(prediction1)
  
  ## RF - Random Forest
  brmodel2 <- br(ds$train, "RF")
  prediction2 <- predict(brmodel2, ds$test)
  temp_rf <- multilabel_evaluate(ds$test, prediction2, "bipartition")

  rm(brmodel2)
  rm(prediction2)

  ## XGB - eXtreme Gradient Boosting
  brmodel3 <- br(ds$train, "XGB")
  prediction3 <- predict(brmodel3, ds$test)
  temp_xgb <- multilabel_evaluate(ds$test, prediction3, "bipartition")

  rm(brmodel3)
  rm(prediction3)
  if(index==1){
  knn <- temp_knn
  rf <- temp_rf
  xgb <- temp_xgb
  }else{
  knn <- cbind(knn,temp_knn)
  rf <- cbind(rf,temp_rf)
  xgb <- cbind(xgb, temp_xgb)
  }

Plotting Graphs Functions

We used the ggplot2 package to display the evaluation metrics graphs. The following function ggplot() for plotting the graphs and ggsave() to save the last plotted graph.

ggplot(performance)+ylab("value")+xlab("model")+scale_y_continuous(breaks=c(0, 0.25, 0.5,0.75, 1.0))+geom_point(aes(x=performance$model,y=performance$value,color=metric),size=5,alpha=0.7)+facet_grid(performance$metric~.)+coord_flip()+theme_bw()
ggsave("/multilabelclassification/graphs/cc_EN_tfidf1.png",width = 8, height = 7, dpi = 400)
ggplot(performance)+geom_tile(aes(x=model,y=metric,fill=value),color="black")+geom_text(aes(x=model,y=metric,label=round(value,3)),color="black")+scale_fill_distiller(palette = "Spectral")
ggsave("/multilabelclassification/graphs/cc_EN_tfidf2.png",width = 6, height = 4, dpi = 400)

Experimental Resutls

We tested the nine models over two languages (English and German) and with two types of features (TF-IDF and the terms incidence).
For the purpose of the evaluation task, The mldr package equiped us with various measurment tools, we will present all of them for each experiment, however, we chose marco F1 measurement(a measure combines between the precision and the recall) to compare the predective performance of all classifiers. Through the exploration process of the Eur-Lex dataset, the study revealed the class labels to be imbalanced (i.e some labels are frequent and some are infrequent). In that case considering accuracy is a misleading measure of the performance, and instead we consider macro F1 as a comparision factor among the classifiers. In Gerneral, across all the experiments, we inferred that combining the label powerset transformation method with the random forest classifier produced the best performance, however, combining the label power set with the XGBoost classifier resulted in the worst performance.

Language: English, Nr of Labels: 7000 Labels

For the English dataset, we observed higher macro F1 rates over all the nine trained classifiers when we used TF-IDFs as the instances features. TF-IDFs are more powerful representative features than simply using the incidence of terms as features. The expreiments show that label Powerset combined with random forest recorded the best result for the two type of features(TF-IDF and Incidence), whereas the label powerset method combined with XGB preformed the worst as shown in the following figures:

In the following sections we will display the results in detail for each feature separately.

Language: English, Feature: TF-IDF, Nr of Labels: 7000 Labels

Label Powerset

According to the following figures, Label Powerset with the K nearest neighbour and random forest performed the best, however, with XGBoost, the perfomance was the worst.

Binary Relevance

We compared the three models where the labels are assumed to be independent. Binary relevance produced the best classifier combined with XGBoost.

Classifier Chains

Similar to the Binary Relevance method, Classifier Chain performed the best with the XGBoost classifier.

Language: English, Feature: Incidence, Nr of Labels: 7000 Labels

We wanted to test the performance of the models in case of employing more naive features as the incedence of terms. For the English language, the experiments showed similar pattern as the case of TFIDF features, where the Label powerset with the random forest classifier scored the highest macro F1 value. Nevertheless, having the incedince of terms as features did not score better performance than the TFIDF features.

Label Powerset
Binary Relevance
Classifier Chains

Language: German, Nr of Labels: 7000 Labels

Unklike the earlier results of the experiments ran for the English dataset,for the German dataset, we observed higher macro F1 rates over all the nine trained classifiers when we used terms incidences instead of the TF-IDFs as the instances features. The expreiments show that label Powerset combined with both the random forest and the K nearest neighbour recorded the best result for the two type of features(TF-IDF and Incidence), compared to the low performance produced by the label powerset method combined with XGB as shown in the following figures:

In the following sections we will display the results of the experimencts ran on the German dataset in detail for each feature separately.

Language: German, Feature: TF-IDF, Nr of Labels: 7000 Labels

Extracting TFIDFs is computentially expensive. Comparing the low performance of the classifiers to the better performance of classifiers trained on the incidence of terms, lead us to qualify the terms incidences as more effective features to train the classifiers on.

Label Powerset

Language: German, Feature: Incidence, Nr of Labels: 7000 Labels

Label Powerset
Binary Relevance

Dataset with reduced number of labels

Training our classification models on a dataset with such a large number of labels(7000 labels) was a challenging task. we expected that reducing the number of labels would improve the predective capacity of the classifiers. To reduce the large number of labels, we take advangtage of the “scumble” measure provided with the mldr package. Scumble measure indicates the concurrence level among frequent and infrequent labels. We disabled majority labels by keeping only labelsets with lower level of scumble values than the mean scumble value of the dataset with the following command line:

datasetWithReducedNrLabels <- dataset[.SCUMBLE <= dataset$measures$scumble]

Language: English, Nr of Labels: less than 7K labels

Pruning the label sets through removing imbalanced label sets improved the performance across all classifiers. Although employing the TFIDFs features yielded in higher macro F1 for classifiers trained on the complete set of labels, training classifiers on the terms incidences in the case of the balanced label sets(reduced number of labels) showed slightly better performance as shown in the followng graphs:

Language: English, Feature: TF-IDF, Nr of Labels: Reduced number of labels

For the TFIDFs features, most of the classifiers trained on the balanced labelsets(reduced number of labels)maintain almost similar levels of performance compared to the performace with the complete set of labels.

Label Powerset
Binary Relevance

Language: English, Feature: Incidence, Nr of Labels: Reduced number of labels

For the balanced label sets, we observed better performance for the incidence features for all classifiers compared to the performance with the complete label sets. The label powerset combined with the random forest classifier outperformed the performance the same classification model trained on the complete set with the TFIDFs features. It can be said that for more balanced label sets, the terms incidencs features were sufficient even for the English dataset.

Label Powerset
Binary Relevance

Language: German, Nr of Labels: Reduced number of labels

Removing the imabalanced labels enhanced the performance significantly for the German dataset.

Language: German, Feature: TF-IDF, Nr of Labels: Reduced number of labels

In the case of removing imbalanced labels, the macro F1 value of the label powerset with the random forest classifier increased by around(15%).

Language: German, Feature: Incidence, Nr of Labels: Reduced number of labels

conclusion

Label powerset method produced generaly the best accuracy compared to the binary relevance methos. Binary relevance method assumes the independecy of labels, on the other hand Label Powerset assumes that coocurring labels in the same instance are correlated, in addition, we have seen that removing the imabalanced label sets did not improve the performance. According to the previous discussion, we present a strong argument that the label sets of the Eur-Lex data set are homogeneously distributed.

=======

Classification task

The EUR-Lex dataset contains 25K documents, which makes it impossible to train a classifier over the whole dataset. We divided the dataset into 24 subsets, each subset contains the same number of documents, we trained the classifiers over the subsets. The final evaluations results will be the average seperatly reported results for each subset. The dataset has around 7000 labels, an older version had only 4000 labels. To observe the effect of including a large number of labels on the classification task in general, we compared the predective performance of the models when trained on a dataset with 7000 labels and when trained on a dataset with a reduced number of labels.

Classification Models

We use three methods to transform the problem of multi-label classification task into a conventional multi-class classification task: binary relevance, Label Powerset, classifier chain. After transforming the classification problem, We trained three different classification models: Random Forest, k nearest neighbors, XGboosted trees.

classifier Models
K nearest neighbours
Random Forest
XGboost
Transformation methods
Binary relevance
Classifier chain
Label powerset

We trained nine classifaction models and compared the performance of the classifiers to produce the model that best fits the dataset and to choose the type of features to yield the best predictive performance. The following table shows the nine classifier models :

KNN-Label Powerset RF-Label Powerset XGboost-Label Powerset
KNN-Binary Relevance RF-Binary Relevance XGboost-Binary Relevance
KNN-Classifier Chain RF-Classifier Chain XGboost-Classifier Chain

Experimental settings

As mentioned earlier, the 25K documents dataset was split into 24 subsets, and the previously mentioned classification models were trained separately in order to compare the performance of these different classifiers. Each subset was split randomly into two disjointly subsets one for training and the other for testing, with the following proportions ( 65% used for training and 35% used for testing). We reported the results of the different models under different settings, we wanted to explore the performance of the classifiers with two types of features, with two languages and with a different number of labels. The following table demonstrates the experimental settings:

Language
English
German
Features
TFIDF
incidence
Number of Labels
7000
<7000

The following code shows the funcitons used for classification, we use utiml package. To split the dataset into training and testing dataset we use the function create_holdout_partition().The example code is showen for binary relevace method. We simply change the method used into one of the following methods: lp() Label Powerset, br() Binary Relevance, cc() Classifier Chain we had to destroy the model after exporting the performance results to avoid any problematics with memory capacity using the function rm()

library(mldr)
library(utiml)

train_ratio <- 0.65
test_ratio <- 0.35
iteration <- 24

for(index in 1:iteration){

  ds <- mldr(paste(generic_name,index,sep = "")) %>%
  remove_attributes("...") %>%
  remove_unique_attributes() %>%
  remove_unlabeled_instances() %>%
  create_holdout_partition(c(train=train_ratio, test=test_ratio))
  
  ## KNN - K nearest neighbour
  brmodel1 <- br(ds$train, "KNN")
  prediction1 <- predict(brmodel1, ds$test)
  temp_knn <- multilabel_evaluate(ds$test, prediction1, "bipartition")

  ##remove model of memory
  rm(brmodel1)
  rm(prediction1)
  
  ## RF - Random Forest
  brmodel2 <- br(ds$train, "RF")
  prediction2 <- predict(brmodel2, ds$test)
  temp_rf <- multilabel_evaluate(ds$test, prediction2, "bipartition")

  rm(brmodel2)
  rm(prediction2)

  ## XGB - eXtreme Gradient Boosting
  brmodel3 <- br(ds$train, "XGB")
  prediction3 <- predict(brmodel3, ds$test)
  temp_xgb <- multilabel_evaluate(ds$test, prediction3, "bipartition")

  rm(brmodel3)
  rm(prediction3)
  if(index==1){
  knn <- temp_knn
  rf <- temp_rf
  xgb <- temp_xgb
  }else{
  knn <- cbind(knn,temp_knn)
  rf <- cbind(rf,temp_rf)
  xgb <- cbind(xgb, temp_xgb)
  }

Experimental Resutls

We tested the nine models over two languages (English and German) and with two types of features (TF-IDF and the terms incidence).
For the purpose of the evaluation task, The mldr package equiped us with various measurment tools, we will present all of them for each experiment, however, we chose marco F1 measurement(a measure combines between the precision and the recall) to compare the predective performance of all classifiers. Through the exploration process of the Eur-Lex dataset, the study revealed the class labels to be imbalanced (i.e some labels are frequent and some are infrequent). In that case considering accuracy is a misleading measure of the performance, and instead we consider macro F1 as a comparision factor among the classifiers. In Gerneral, across all the experiments, we inferred that combining the label powerset transformation method with the random forest classifier produced the best performance, however, combining the label power set with the XGBoost classifier resulted in the worst performance.

Language: English, Nr of Labels: 7000 Labels

For the English dataset, we observed higher macro F1 rates over all the nine trained classifiers when we used TF-IDFs as the instances features. TF-IDFs are more powerful representative features than simply using the incidence of terms as features. The expreiments show that label Powerset combined with random forest recorded the best result for the two type of features(TF-IDF and Incidence), whereas the label powerset method combined with XGB preformed the worst as shown in the following figures:

In the following sections we will display the results in detail for each feature separately.

Language: English, Feature: TF-IDF, Nr of Labels: 7000 Labels

Label Powerset

According to the following figures, Label Powerset with the K nearest neighbour and random forest performed the best, however, with XGBoost, the perfomance was the worst.

Binary Relevance

We compared the three models where the labels are assumed to be independent. Binary relevance produced the best classifier combined with XGBoost.

Classifier Chains

Similar to the Binary Relevance method, Classifier Chain performed the best with the XGBoost classifier.

Language: English, Feature: Incidence, Nr of Labels: 7000 Labels

We wanted to test the performance of the models in case of employing more naive features as the incedence of terms. For the English language, the experiments showed similar pattern as the case of TFIDF features, where the Label powerset with the random forest classifier scored the highest macro F1 value. Nevertheless, having the incedince of terms as features did not score better performance than the TFIDF features.

Label Powerset
Binary Relevance
Classifier Chains

Language: German, Nr of Labels: 7000 Labels

Unklike the earlier results of the experiments ran for the English dataset,for the German dataset, we observed higher macro F1 rates over all the nine trained classifiers when we used terms incidences instead of the TF-IDFs as the instances features. The expreiments show that label Powerset combined with both the random forest and the K nearest neighbour recorded the best result for the two type of features(TF-IDF and Incidence), compared to the low performance produced by the label powerset method combined with XGB as shown in the following figures:

In the following sections we will display the results of the experimencts ran on the German dataset in detail for each feature separately.

Language: German, Feature: TF-IDF, Nr of Labels: 7000 Labels

Extracting TFIDFs is computentially expensive. Comparing the low performance of the classifiers to the better performance of classifiers trained on the incidence of terms, lead us to qualify the terms incidences as more effective features to train the classifiers on.

Label Powerset

Language: German, Feature: Incidence, Nr of Labels: 7000 Labels

Label Powerset
Binary Relevance

Dataset with reduced number of labels

Training our classification models on a dataset with such a large number of labels(7000 labels) was a challenging task. we expected that reducing the number of labels would improve the predective capacity of the classifiers. To reduce the large number of labels, we take advangtage of the “scumble” measure provided with the mldr package. Scumble measure indicates the concurrence level among frequent and infrequent labels. We disabled majority labels by keeping only labelsets with lower level of scumble values than the mean scumble value of the dataset with the following command line:

datasetWithReducedNrLabels <- dataset[.SCUMBLE <= dataset$measures$scumble]

Language: English, Nr of Labels: less than 7K labels

Pruning the label sets through removing imbalanced label sets improved the performance across all classifiers. Although employing the TFIDFs features yielded in higher macro F1 for classifiers trained on the complete set of labels, training classifiers on the terms incidences in the case of the balanced label sets(reduced number of labels) showed slightly better performance as shown in the followng graphs:

Language: English, Feature: TF-IDF, Nr of Labels: Reduced number of labels

For the TFIDFs features, most of the classifiers trained on the balanced labelsets(reduced number of labels)maintain almost similar levels of performance compared to the performace with the complete set of labels.

Label Powerset
Binary Relevance

Language: English, Feature: Incidence, Nr of Labels: Reduced number of labels

For the balanced label sets, we observed better performance for the incidence features for all classifiers compared to the performance with the complete label sets. The label powerset combined with the random forest classifier outperformed the performance the same classification model trained on the complete set with the TFIDFs features. It can be said that for more balanced label sets, the terms incidencs features were sufficient even for the English dataset.

Label Powerset
Binary Relevance

Language: German, Nr of Labels: Reduced number of labels

Removing the imabalanced labels enhanced the performance significantly for the German dataset.

Language: German, Feature: TF-IDF, Nr of Labels: Reduced number of labels

In the case of removing imbalanced labels, the macro F1 value of the label powerset with the random forest classifier increased by around(15%).

Language: German, Feature: Incidence, Nr of Labels: Reduced number of labels

conclusion

Label powerset method produced generaly the best accuracy compared to the binary relevance methos. Binary relevance method assumes the independecy of labels, on the other hand Label Powerset assumes that coocurring labels in the same instance are correlated, in addition, we have seen that removing the imabalanced label sets did not improve the performance. According to the previous discussion, we present a strong argument that the label sets of the Eur-Lex data set are homogeneously distributed.

>>>>>>> efd6fc0f36a4a63a09e16d78d54d7a62b153dd2b